We're releasing Ultravox v0.5 today. The weights have been pushed to Hugging Face. If you're using the Ultravox Realtime APIs, v0.5 is the new default.
What's New
v0.5 improves upon 0.4.1 in the following ways:
- 60% improvement in transcription accuracy, with lower word error rates (WER) across 82 evaluation sets from LibriSpeech, CommonVoice, and Fleurs.
- 18% improvement in speech-based web question answering, particularly in handling named entities and fine-grained speech details.
- 24% improvement in X-to-English translation, as measured by BLEU across 19 languages
- Expanded language support from 15 to 42 languages, making it significantly more accessible for global applications.
42 Languages Supported
Arabic, Belarusian, Bengali, Bulgarian, Chinese, Czech, Danish, Dutch, English, Estonian, Finnish, French, Galician, Georgian, German, Greek, Hindi, Hungarian, Italian, Japanese, Latvian, Lithuanian, Macedonian, Marathi, Persian, Polish, Portuguese, Romanian, Russian, Serbian, Slovak, Slovenian, Spanish, Swahili, Swedish, Tamil, Thai, Turkish, Ukrainian, Urdu, Vietnamese, Welsh.
Evals
Our primary method of evaluation is speech translation, measured by BLEU and, newly for v0.5, Big Bench Audio for general reasoning in response to Audio input.
Ultravox 70B
Ultravox 0.4.1 70B | Ultravox 0.5 70B | |
---|---|---|
covost2 en_ar | 19.64 | 20.21 |
covost2 en_de | 32.47 | 34.53 |
covost2 es_en | 40.76 | 43.29 |
covost2 ru_en | 45.07 | 48.99 |
covost2 en_ca | 37.58 | 40.01 |
covost2 zh_en | 17.98 | 21.37 |
big bench audio | 76.20 | 82.70 |
Ultravox 8B
Ultravox 0.4.1 8B | Ultravox 0.5 8B | |
---|---|---|
covost2 en_ar | 12.28 | 12.99 |
covost2 en_ca | 29.94 | 31.54 |
covost2 en_de | 27.13 | 28.70 |
covost2 es_en | 39.16 | 40.19 |
covost2 ru_en | 39.65 | 42.13 |
covost2 zh_en | 14.55 | 17.22 |
big bench audio | 63.20 | 66.54 |
Training
This version of Ultravox continues to use a frozen Llama pre-trained core (3.1 for 8B and 3.3 for 70B), but we've significantly increased the size of the data and the overall training time. The training time on 8xH100s is about ~100 hours for the 8B model and ~150 hours for the 70B model.
What's Changed
- Audio streaming training with masking by @saeeddhqan in #148
- Defining block size in UltravoxConfig, and solving assertions by @saeeddhqan in #157
- Gradio demo for real-time conversations with WebRTC by @freddyaboulton in #150
- Fix "AttributeError: 'NoneType' object has no attribute 'tokenizer'" by @farzadab in #173
- docs: update README.md by @eltociear in #174
- Update ultravox model and config for v0.5 by @farzadab in #276
New Contributors
- @saeeddhqan made their first contribution in #148
- @freddyaboulton made their first contribution in #150
- @eltociear made their first contribution in #174
Full Changelog: v0.4.1...v0.5